In [1]:
import graphlab as gl

In [2]:
!head -n 2 ../data/yelp/yelp_training_set_review.json


{"votes": {"funny": 0, "useful": 5, "cool": 2}, "user_id": "rLtl8ZkDX5vH5nAx9C3q5Q", "review_id": "fWKvX83p0-ka4JS3dc6E5A", "stars": 5, "date": "2011-01-26", "text": "My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best \"toast\" I've ever had.\n\nAnyway, I can't wait to go back!", "type": "review", "business_id": "9yKzy9PApeiPPOUJEtnvkg"}
{"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "0a2KyEL0d3Yb1V6aivbIuQ", "review_id": "IjZ33sJrzXqU-0X6U8NwyA", "stars": 5, "date": "2011-07-27", "text": "I have no idea why some people give bad reviews about this place. It goes to show you, you can please everyone. They are probably griping about something that their own fault...there are many people like that.\n\nIn any case, my friend and I arrived at about 5:50 PM this past Sunday. It was pretty crowded, more than I thought for a Sunday evening and thought we would have to wait forever to get a seat but they said we'll be seated when the girl comes back from seating someone else. We were seated at 5:52 and the waiter came and got our drink orders. Everyone was very pleasant from the host that seated us to the waiter to the server. The prices were very good as well. We placed our orders once we decided what we wanted at 6:02. We shared the baked spaghetti calzone and the small \"Here's The Beef\" pizza so we can both try them. The calzone was huge and we got the smallest one (personal) and got the small 11\" pizza. Both were awesome! My friend liked the pizza better and I liked the calzone better. The calzone does have a sweetish sauce but that's how I like my sauce!\n\nWe had to box part of the pizza to take it home and we were out the door by 6:42. So, everything was great and not like these bad reviewers. That goes to show you that  you have to try these things yourself because all these bad reviewers have some serious issues.", "type": "review", "business_id": "ZRJwVLyzEJq1VAihDhYiow"}

SFrame -- Scalable Dataframe

Powerful unstructured data processing: read straight up json


In [3]:
reviews = gl.SFrame.read_csv('../data/yelp/yelp_training_set_review.json', header=False)
reviews


[INFO] This commercial license of GraphLab Create is assigned to engr@turi.com.

[INFO] Start server at: ipc:///tmp/graphlab_server-28139 - Server binary: /Users/alicez/.graphlab/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1443318283.log
[INFO] GraphLab Server Version: 1.6.1
PROGRESS: Finished parsing file /Users/alicez/Documents/training/Strata NYC 2015/data/yelp/yelp_training_set_review.json
PROGRESS: Parsing completed. Parsed 100 lines in 0.85514 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[dict]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Read 55824 lines. Lines per second: 34072.2
PROGRESS: Finished parsing file /Users/alicez/Documents/training/Strata NYC 2015/data/yelp/yelp_training_set_review.json
PROGRESS: Parsing completed. Parsed 229907 lines in 4.56457 secs.
Out[3]:
X1
{'votes': {'funny': 0,
'useful': 5, 'cool': 2}, ...
{'votes': {'funny': 0,
'useful': 0, 'cool': 0}, ...
{'votes': {'funny': 0,
'useful': 1, 'cool': 0}, ...
{'votes': {'funny': 0,
'useful': 2, 'cool': 1}, ...
{'votes': {'funny': 0,
'useful': 0, 'cool': 0}, ...
{'votes': {'funny': 1,
'useful': 3, 'cool': 4}, ...
{'votes': {'funny': 4,
'useful': 7, 'cool': 7}, ...
{'votes': {'funny': 0,
'useful': 1, 'cool': 0}, ...
{'votes': {'funny': 0,
'useful': 0, 'cool': 0}, ...
{'votes': {'funny': 0,
'useful': 1, 'cool': 0}, ...
[229907 rows x 1 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [4]:
reviews[0]


Out[4]:
{'X1': {'business_id': '9yKzy9PApeiPPOUJEtnvkg',
  'date': '2011-01-26',
  'review_id': 'fWKvX83p0-ka4JS3dc6E5A',
  'stars': 5,
  'text': 'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!',
  'type': 'review',
  'user_id': 'rLtl8ZkDX5vH5nAx9C3q5Q',
  'votes': {'cool': 2, 'funny': 0, 'useful': 5}}}

Unpack to extract structure


In [5]:
reviews=reviews.unpack('X1','')
reviews


Out[5]:
business_id date review_id stars text type
9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on
my birthday for break ...
review
ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some
people give bad reviews ...
review
6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 IESLBzqUCLdSzSqm0eCSxQ 4 love the gyro plate. Rice
is so good and I also ...
review
_1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 G-WvGaISbqqaMHlNnByodA 5 Rosie, Dakota, and I LOVE
Chaparral Dog Park!!! ...
review
6ozycU1RpktNG2-1BroVtw 2012-01-05 1uJFq2r5QfJG_6ExMRCaGw 5 General Manager Scott
Petello is a good egg!!! ...
review
-yxfBYGB6SEqszmxJxd97A 2007-12-13 m2CKSsepBCoRYWxiRUsxAg 4 Quiessence is, simply
put, beautiful. Full ...
review
zp713qNhx8d9KCJJnrw1xA 2010-02-12 riFQ3vxNpP4rWLk_CSri2A 5 Drop what you're doing
and drive here. After I ...
review
hW0Ne_HTHEAgGF1rAdmR-g 2012-07-12 JL7GXJ9u4YMx7Rzs05NfiQ 4 Luckily, I didn't have to
travel far to make my ...
review
wNUea3IXZWD63bbOQaOH-g 2012-08-17 XtnfnYmnJYi71yIuGsXIUA 4 Definitely come for Happy
hour! Prices are amaz ...
review
nMHhuYan8e3cONo3PornJA 2010-08-11 jJAIXA46pU1swYyRCdfXtQ 5 Nobuo shows his unique
talents with everything ...
review
user_id votes
rLtl8ZkDX5vH5nAx9C3q5Q {'funny': 0, 'useful': 5,
'cool': 2} ...
0a2KyEL0d3Yb1V6aivbIuQ {'funny': 0, 'useful': 0,
'cool': 0} ...
0hT2KtfLiobPvh6cDC8JQg {'funny': 0, 'useful': 1,
'cool': 0} ...
uZetl9T0NcROGOyFfughhg {'funny': 0, 'useful': 2,
'cool': 1} ...
vYmM4KTsC8ZfQBg-j5MWkw {'funny': 0, 'useful': 0,
'cool': 0} ...
sqYN3lNgvPbPCTRsMFu27g {'funny': 1, 'useful': 3,
'cool': 4} ...
wFweIWhv2fREZV_dYkz_1g {'funny': 4, 'useful': 7,
'cool': 7} ...
1ieuYcKS7zeAv_U15AB13A {'funny': 0, 'useful': 1,
'cool': 0} ...
Vh_DlizgGhSqQh4qfZ2h6A {'funny': 0, 'useful': 0,
'cool': 0} ...
sUNkXg8-KFtCMQDV6zRzQg {'funny': 0, 'useful': 1,
'cool': 0} ...
[229907 rows x 8 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Votes are still crammed in a dictionary. Let's unpack it.


In [6]:
reviews = reviews.unpack('votes', '')
reviews


Out[6]:
business_id date review_id stars text type
9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on
my birthday for break ...
review
ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some
people give bad reviews ...
review
6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 IESLBzqUCLdSzSqm0eCSxQ 4 love the gyro plate. Rice
is so good and I also ...
review
_1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 G-WvGaISbqqaMHlNnByodA 5 Rosie, Dakota, and I LOVE
Chaparral Dog Park!!! ...
review
6ozycU1RpktNG2-1BroVtw 2012-01-05 1uJFq2r5QfJG_6ExMRCaGw 5 General Manager Scott
Petello is a good egg!!! ...
review
-yxfBYGB6SEqszmxJxd97A 2007-12-13 m2CKSsepBCoRYWxiRUsxAg 4 Quiessence is, simply
put, beautiful. Full ...
review
zp713qNhx8d9KCJJnrw1xA 2010-02-12 riFQ3vxNpP4rWLk_CSri2A 5 Drop what you're doing
and drive here. After I ...
review
hW0Ne_HTHEAgGF1rAdmR-g 2012-07-12 JL7GXJ9u4YMx7Rzs05NfiQ 4 Luckily, I didn't have to
travel far to make my ...
review
wNUea3IXZWD63bbOQaOH-g 2012-08-17 XtnfnYmnJYi71yIuGsXIUA 4 Definitely come for Happy
hour! Prices are amaz ...
review
nMHhuYan8e3cONo3PornJA 2010-08-11 jJAIXA46pU1swYyRCdfXtQ 5 Nobuo shows his unique
talents with everything ...
review
user_id cool funny useful
rLtl8ZkDX5vH5nAx9C3q5Q 2 0 5
0a2KyEL0d3Yb1V6aivbIuQ 0 0 0
0hT2KtfLiobPvh6cDC8JQg 0 0 1
uZetl9T0NcROGOyFfughhg 1 0 2
vYmM4KTsC8ZfQBg-j5MWkw 0 0 0
sqYN3lNgvPbPCTRsMFu27g 4 1 3
wFweIWhv2fREZV_dYkz_1g 7 4 7
1ieuYcKS7zeAv_U15AB13A 0 0 1
Vh_DlizgGhSqQh4qfZ2h6A 0 0 0
sUNkXg8-KFtCMQDV6zRzQg 0 0 1
[229907 rows x 10 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Quick data visualization


In [7]:
reviews.show()


Canvas is accessible via web browser at the URL: http://localhost:52499/index.html
Opening Canvas in default web browser.

Represent datetime


In [8]:
reviews['date'] = reviews['date'].str_to_datetime(str_format='%Y-%m-%d')

Munge votes and add a new column


In [9]:
reviews['total_votes'] = reviews['funny'] + reviews['cool'] + reviews['useful']
reviews


Out[9]:
business_id date review_id stars text type
9yKzy9PApeiPPOUJEtnvkg 2011-01-26 00:00:00 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on
my birthday for break ...
review
ZRJwVLyzEJq1VAihDhYiow 2011-07-27 00:00:00 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some
people give bad reviews ...
review
6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 00:00:00 IESLBzqUCLdSzSqm0eCSxQ 4 love the gyro plate. Rice
is so good and I also ...
review
_1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 00:00:00 G-WvGaISbqqaMHlNnByodA 5 Rosie, Dakota, and I LOVE
Chaparral Dog Park!!! ...
review
6ozycU1RpktNG2-1BroVtw 2012-01-05 00:00:00 1uJFq2r5QfJG_6ExMRCaGw 5 General Manager Scott
Petello is a good egg!!! ...
review
-yxfBYGB6SEqszmxJxd97A 2007-12-13 00:00:00 m2CKSsepBCoRYWxiRUsxAg 4 Quiessence is, simply
put, beautiful. Full ...
review
zp713qNhx8d9KCJJnrw1xA 2010-02-12 00:00:00 riFQ3vxNpP4rWLk_CSri2A 5 Drop what you're doing
and drive here. After I ...
review
hW0Ne_HTHEAgGF1rAdmR-g 2012-07-12 00:00:00 JL7GXJ9u4YMx7Rzs05NfiQ 4 Luckily, I didn't have to
travel far to make my ...
review
wNUea3IXZWD63bbOQaOH-g 2012-08-17 00:00:00 XtnfnYmnJYi71yIuGsXIUA 4 Definitely come for Happy
hour! Prices are amaz ...
review
nMHhuYan8e3cONo3PornJA 2010-08-11 00:00:00 jJAIXA46pU1swYyRCdfXtQ 5 Nobuo shows his unique
talents with everything ...
review
user_id cool funny useful total_votes
rLtl8ZkDX5vH5nAx9C3q5Q 2 0 5 7
0a2KyEL0d3Yb1V6aivbIuQ 0 0 0 0
0hT2KtfLiobPvh6cDC8JQg 0 0 1 1
uZetl9T0NcROGOyFfughhg 1 0 2 3
vYmM4KTsC8ZfQBg-j5MWkw 0 0 0 0
sqYN3lNgvPbPCTRsMFu27g 4 1 3 8
wFweIWhv2fREZV_dYkz_1g 7 4 7 18
1ieuYcKS7zeAv_U15AB13A 0 0 1 1
Vh_DlizgGhSqQh4qfZ2h6A 0 0 0 0
sUNkXg8-KFtCMQDV6zRzQg 0 0 1 1
[229907 rows x 11 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Filter rows to remove reviews with no votes


In [10]:
reviews['total_votes'] > 0


Out[10]:
dtype: int
Rows: 229907
[1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, ... ]

In [11]:
reviews = reviews[reviews['total_votes'] > 0]
reviews


Out[11]:
business_id date review_id stars text type
9yKzy9PApeiPPOUJEtnvkg 2011-01-26 00:00:00 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on
my birthday for break ...
review
6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 00:00:00 IESLBzqUCLdSzSqm0eCSxQ 4 love the gyro plate. Rice
is so good and I also ...
review
_1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 00:00:00 G-WvGaISbqqaMHlNnByodA 5 Rosie, Dakota, and I LOVE
Chaparral Dog Park!!! ...
review
-yxfBYGB6SEqszmxJxd97A 2007-12-13 00:00:00 m2CKSsepBCoRYWxiRUsxAg 4 Quiessence is, simply
put, beautiful. Full ...
review
zp713qNhx8d9KCJJnrw1xA 2010-02-12 00:00:00 riFQ3vxNpP4rWLk_CSri2A 5 Drop what you're doing
and drive here. After I ...
review
hW0Ne_HTHEAgGF1rAdmR-g 2012-07-12 00:00:00 JL7GXJ9u4YMx7Rzs05NfiQ 4 Luckily, I didn't have to
travel far to make my ...
review
nMHhuYan8e3cONo3PornJA 2010-08-11 00:00:00 jJAIXA46pU1swYyRCdfXtQ 5 Nobuo shows his unique
talents with everything ...
review
AsSCv0q_BWqIe3mX2JqsOQ 2010-06-16 00:00:00 E11jzpKz9Kw5K7fuARWfRw 5 The oldish man who owns
the store is as sweet as ...
review
e9nN4XxjdHj4qtKCOPq_vg 2011-10-21 00:00:00 3rPt0LxF7rgmEUrznoH22w 5 Wonderful Vietnamese
sandwich shoppe. Their ...
review
h53YuCiIDfEFSJCQpk8v1g 2010-01-11 00:00:00 cGnKNX3I9rthE0-TH24-qA 5 They have a limited time
thing going on right now ...
review
user_id cool funny useful total_votes
rLtl8ZkDX5vH5nAx9C3q5Q 2 0 5 7
0hT2KtfLiobPvh6cDC8JQg 0 0 1 1
uZetl9T0NcROGOyFfughhg 1 0 2 3
sqYN3lNgvPbPCTRsMFu27g 4 1 3 8
wFweIWhv2fREZV_dYkz_1g 7 4 7 18
1ieuYcKS7zeAv_U15AB13A 0 0 1 1
sUNkXg8-KFtCMQDV6zRzQg 0 0 1 1
-OMlS6yWkYjVldNhC31wYg 1 1 3 5
C1rHp3dmepNea7XiouwB6Q 1 0 1 2
UPtysDF6cUDUxq2KY-6Dcg 1 0 2 3
[? rows x 11 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.

Classifiction task

Predict which reviews will be voted "funny," based on review text.

First, the labels. Reviews with at least one vote for "funny" is funny.


In [12]:
reviews['funny'] = reviews['funny'] > 0

In [13]:
reviews = reviews[['text','funny']]
reviews


Out[13]:
text funny
My wife took me here on
my birthday for break ...
0
love the gyro plate. Rice
is so good and I also ...
0
Rosie, Dakota, and I LOVE
Chaparral Dog Park!!! ...
0
Quiessence is, simply
put, beautiful. Full ...
1
Drop what you're doing
and drive here. After I ...
1
Luckily, I didn't have to
travel far to make my ...
0
Nobuo shows his unique
talents with everything ...
0
The oldish man who owns
the store is as sweet as ...
1
Wonderful Vietnamese
sandwich shoppe. Their ...
0
They have a limited time
thing going on right now ...
0
[147084 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

To save time, take just a small subset


In [14]:
reviews = reviews[:10000]

Create bag-of-words representation of text


In [15]:
word_delims = ["\r", "\v", "\n", "\f", "\t", " ", 
               '~', '`', '!', '@', '#', '$', '%', '^', '&', '*', '-', '_', '+', '=', 
               ',', '.', ';', ':', '\"', '?', '|', '\\', '/', 
               '<', '>', '(', ')', '[', ']', '{', '}']

reviews['bow'] = gl.text_analytics.count_words(reviews['text'], delimiters=word_delims)

Create tf-idf representation of the bag of words


In [16]:
reviews['tf_idf'] = gl.text_analytics.tf_idf(reviews['bow'])

In [17]:
reviews['tf_idf'] = reviews['tf_idf'].apply(lambda x: x['docs'])

In [18]:
reviews


Out[18]:
text funny bow tf_idf
My wife took me here on
my birthday for break ...
0 {'anyway': 1, 'looks': 1,
'go': 1, 'toast': 1, ...
{'anyway':
3.564893474332945, ...
love the gyro plate. Rice
is so good and I also ...
0 {'and': 1, 'plate': 1,
'selection': 1, 'love': ...
{'and':
0.08621169681906551, ...
Rosie, Dakota, and I LOVE
Chaparral Dog Park!!! ...
0 {'and': 8, 'does': 1,
'all': 1, 'surrounded': ...
{'and':
0.6896935745525241, ...
Quiessence is, simply
put, beautiful. Full ...
1 {'45': 1, 'seated': 1,
'just': 1, 'bring': 1, ...
{'just':
1.0552654841647438, ...
Drop what you're doing
and drive here. After I ...
1 {'cute': 1, 'condesa': 1,
'desolate': 1, 'mexic ...
{'cute':
3.575550768806933, ...
Luckily, I didn't have to
travel far to make my ...
0 {'and': 2, 'presence': 1,
"didn't": 1, 'as': 1, ...
{'and':
0.17242339363813103, ...
Nobuo shows his unique
talents with everything ...
0 {'and': 1, 'pork': 1,
'features': 1, 'go': 1, ...
{'and':
0.08621169681906551, ...
The oldish man who owns
the store is as sweet as ...
1 {'and': 1, 'cookies': 2,
'sweet': 1, 'is': 1, ...
{'and':
0.08621169681906551, ...
Wonderful Vietnamese
sandwich shoppe. Their ...
0 {'selection': 1, 'have':
2, 'baguettes': 1, ...
{'selection':
2.6882475738060303, ...
They have a limited time
thing going on right now ...
0 {'and': 1, 'limited': 1,
'all': 1, 'on': 1, ...
{'and':
0.08621169681906551, ...
[10000 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Create a train-test split

Returns immediately because SFrame operations are lazily evaluated.


In [19]:
train_sf, test_sf = reviews.random_split(0.8)

Train classifiers on bow and tf-idf

Dictionaries are automatically interpreted as sparse features.

Not demonstrated here, but any string/categorical columns are automatically interpreted as sparse features as well.


In [20]:
m1 = gl.logistic_classifier.create(train_sf, 
                                   'funny', 
                                   features=['bow'], 
                                   validation_set=None, 
                                   feature_rescaling=False)


PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 7992
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 30012
PROGRESS: Number of coefficients    : 30013
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+
PROGRESS: | 1         | 6        | 0.000002  | 1.246551     | 0.474099          |
PROGRESS: | 2         | 9        | 5.000000  | 1.400293     | 0.539039          |
PROGRESS: | 3         | 10       | 5.000000  | 1.475651     | 0.588088          |
PROGRESS: | 4         | 11       | 5.000000  | 1.547559     | 0.522523          |
PROGRESS: | 5         | 13       | 1.000000  | 1.654165     | 0.592593          |
PROGRESS: | 6         | 14       | 1.000000  | 1.719261     | 0.564565          |
PROGRESS: | 10        | 19       | 1.000000  | 2.110180     | 0.622873          |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+

In [21]:
m2 = gl.logistic_classifier.create(train_sf, 
                                   'funny', 
                                   features=['tf_idf'], 
                                   validation_set=None, 
                                   feature_rescaling=False)


PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 7992
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 30012
PROGRESS: Number of coefficients    : 30013
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+
PROGRESS: | 1         | 6        | 0.000011  | 0.383228     | 0.481231          |
PROGRESS: | 2         | 9        | 5.000000  | 0.620402     | 0.678929          |
PROGRESS: | 3         | 10       | 5.000000  | 0.731355     | 0.736862          |
PROGRESS: | 4         | 15       | 0.082708  | 1.089782     | 0.750375          |
PROGRESS: | 5         | 16       | 0.082708  | 1.211477     | 0.750501          |
PROGRESS: | 6         | 17       | 0.082708  | 1.371576     | 0.760511          |
PROGRESS: | 10        | 21       | 0.082708  | 1.827941     | 0.793293          |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+

Evaluate on validation set and compare performance


In [22]:
m1_res = m1.evaluate(test_sf)
m1_res


Out[22]:
{'accuracy': 0.6055776892430279, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        1        |  266  |
 |      0       |        0        |  778  |
 |      1       |        1        |  438  |
 |      1       |        0        |  526  |
 +--------------+-----------------+-------+
 [4 rows x 3 columns]}

In [23]:
m2_res = m2.evaluate(test_sf)
m2_res


Out[23]:
{'accuracy': 0.6170318725099602, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        1        |  351  |
 |      0       |        0        |  693  |
 |      1       |        1        |  546  |
 |      1       |        0        |  418  |
 +--------------+-----------------+-------+
 [4 rows x 3 columns]}

Baseline accuracy (what if we classify everything as the majority class)

Percentage of 'funny' reviews:


In [25]:
float(test_sf['funny'].sum())/test_sf.num_rows()


Out[25]:
0.4800796812749004

Percentage of not funny reviews:


In [26]:
1.0 - float(test_sf['funny'].sum())/test_sf.num_rows()


Out[26]:
0.5199203187250996

In [ ]: